On 

the 

Edit 

ng  Distance  Between  Trees 

by 
Kaizhong  Zhang 
Dennis  Shasha 

Ultracomputer  Note 

and  Re 

122 

ated  Problems 

NYU  Computer  Science  Technical  Rep 

Drt  310 

August,  1987 

Ultracomputer  Research  Laboratory 


o 


(U 
4J 

(1)  (0 

c  a> 

CO  *-" 

O    C   <U 

U  -M  -H  ^ 
CO   (0  -0 

g  NO 


\ 


New  York  University 

Courant  Institute  of  Mathematical  Sciences 

Division  of  Computer  Science 

251  Mercer  Street,  New  York,  NY  10012 


On  the  Editing  Distance  Between  Trees  and  Related  Problems 

by 
Kaizhong  Zhang 
Dennis  Shasha 

Ultracomputer  Note  122 

NYU  Computer  Science  Technical  Report  310 

August,  1987 


This  work  was  supported  in  part  by  the  National  Science  Foundation  under  grant  number 
DCR8501611,  by  the  Office  of  Naval  Research  under  grant  number  N00014-85-K-0046,  and  by  the 
Supercomputing  Research  Center. 


ABSTRACT 

Since  a  tree  can  represent  a  scene  description,  a  grammar  parse,  a  structural 
description,  and  many  other  phenomena,  comparing  trees  is  a  way  to  compare 
scenes,  parses  and  so  on.  We  consider  the  distance  between  two  trees  to  be  the 
(weighted)  number  of  edit  operations  (insert,  delete,  and  modify)  to  transform  one 
tree  to  another.    Then,  we  consider  the  following  kinds  of  questions: 

1.  What  is  the  distance  between  two  trees? 

2.  What  is  the  minimum  distance  between  Ti  and  T2  with  an  arbitrary  subtree  re- 
moved? V^■hat  if  zero  or  more  subtrees  can  be  removed  from  Ti''  A  specialization 
of  these  algorithms  solves  the  problem  of  approximate  tree  matching. 

3.  Given  k,  are  Tj  and  Tt  within  distance  k  of  one  another  and  if  so  what  is  their 
distance? 

We  present  a  postorder  dynamic  programming  algorithm  to  solve  question  1  in 
sequential  time  0(  jTi  |  x  jT;  |  x  depthiTi^  x  depth!  Ti ))  compared  with  0( 
|Ti  i  X  |T;  i  X  (depth(Ti))-  x  (depth(T2 ) )")  for  the  best  previous  algorithm 
due  to  Tai.  Further,  our  algorithm  can  be  parallelized  to  give  time  0( 
|Ti  ]+  |T;  |).  Our  approach  extends  to  answer  question  2  giving  an  algorithm  of 
the  same  complexity.  For  question  3,  a  variant  of  our  distance  algorithm  yields  an 
0(  k-  X  min(  |Ti  |.  jTi  ' )  x  min(  depth(Ti)  ,depth(T2)))  algorithm. 


1.    Editing  distance  between  two  trees 

Ordered  labeled  trees  are  trees  in  which  the  left-to-right  order  among  siblings  is  significant. 
The  distance  between  such  trees  is  a  generalization  of  the  distance  between  strings. 

The  following  definitions  are  basically  from  [T-79,  WF-74].  Let  Ti  and  T;  be  two  trees  with 
N'l  and  N;  nodes  respectively.  Suppose  that  we  have  an  ordering  for  each  tree  and  T;[i]  means  the 
ith  node  of  tree  Tj  in  the  given  ordering  T[i..j]  is  the  subtree  of  T  whose  nodes  are  numbered  i  to  j 
inclusive.  If  i  >  j,  then  T[i..j]  =  0.  An  edit  operation  is  a  pair  (a,b)  ^  (A, A)  of  strings  of  length 
less  than  or  equal  to  1  and  is  usually  written  as  a  -  b.  We  call  a  -  b  a  change  operation  if  a  7^  A  and 
h  *  A;  a  delete  operation  if  b  =  A;  and  an  insert  operation  if  a  =  A. 

Let  us  consider  these  three  kinds  of  operations. 

1.  Change:  To  change  one  node  label  to  another. 


(a-b) 


2.  Delete:  To  delete  a  node. 
(AH  children  of  the  deleted  node  b  become 
children  of  the  parent  a.) 


(b  -  A) 


3.  Insert:  To  insert  a  node. 
(A  consecutive  sequence  of  siblings  among  the 
children  of  a  become  the  children  of  b.) 


(.\  -  b) 


Let  S  be  a  sequence  Sj s^  of  edit  operations.    An  S-derivation  from  A  to  B  is  a  sequence  of 

trees  Aq,  ■•■  A^  such  that  A=  Aq,  B=  A^,  and  A.-i  -  A,  via  s,  for  1  <  i  <  k. 

Let  -y  be  an  cost  function  which  assigns  to  each  edit  operation  a  -  b  a  nonnegative  real  number 
"y(a  -  b).  This  cost  can  be  different  for  different  nodes,  so  it  can  be  used  to  give  greater  weights  to, 
for  example,  the  higher  nodes  in  a  tree  than  to  lower  nodes. 

We  constrain  ^  to  be  a  distance  metric.  That  is, 
i)  7(a  -  a)  =0; 
ii)  -y(a  -  b)  =   -y  (b  -  a);  and 
iii)  (a  -  c)  <  ^(a  -  b)  -  -y(b  -  c). 

i=k 

We  extend  y  to  the  sequence  S  by  letting  -y(S)  =    ^^(Sj).    Formally  the  distance  between  Tj 

1=1 
and  T;  is  defined  as: 

5  (Ti,T2)   =   min  {yiS)  |  S  is  an  edit  operation  sequence  taking  Ti   to  T^}-    The  definition  of  -y 

makes  this  a  distance  metric  also. 

The  edit  operations  give  rise  to  a  mapping.  Intuitively,  a  mapping  is  a  description  of  how  a 
sequence  of  edit  operations  transform  T;  into  T;,  ignoring  the  order  in  which  edit  operations  are 
applied. 


LItracomputer  Note  122 


Page 


Consider  the  following  diagram  of  a  mapping: 


T- 


f 


A  dotted  line  from  Ti[i]  to  T;[j]  indicates  that  T:[i)  should  be  changed  to  T;[j]  if  Ti[i]^ 
T2[j],  or  that  Ti[i]  remains  unchanged  if  T;[il  =  T2[j].  The  nodes  of  Ti  not  touched  by  a  dotted  line 
are  to  be  deleted  and  the  nodes  of  T;  not  touched  are  to  be  inserted.  The  mapping  shows  a  way  to 
transform  Ti  to  T2  ■ 

Formally  we  define  a  triple  (M.Ti.T;)  to  be  a  mapping  from  T;  to  T;,  where  M  is  any  set  of 
pair  of  integers  (i,j)  satisfying:' 

(1)  l<is\;,   1<;<N;: 

(2)  For  any  pair  of  (i;  ,ji)  and  (i;  .12  )  in  M, 

(a)  ii  =  i2  iff  ii=J2 

(b)  Ti[ii]  IS  to  the  leftof  T:[i2]  iff  T2LU]  '^  to  the  left  of  T2Li;] 

(c)  Ti[ii]  is  an  ancestor  of  Ti[i2]  iff  T2U1]  '^  ^n  ancestor  of  T2U2] 

We  will  use  M  instead  of  (M.Ti,T2)  if  there  is  no  confusion.  Let  M  be  a  mapping  from  Ti  to 
T2-  Let  I  and  J  be  the  sets  of  nodes  in  T]  and  T2.  respectively,  not  touched  by  any  line  in  M.  Then 
we  can  define  the  cost  of  .\I: 

7(M)=     X    7(TUij-T2Li])-  27^TUi:-A)-  27(A-Ti[j]) 
(i,))-cM  iH  )-;j 

Mappings  can  be  composed.    Let  M^  be  a  mapping  from  Ti  to  T2  and  let  M2  be  a  mapping 
from  Tt  to  T3.    Define 
Mio\f2  =  {(i,j)|3  k  s.t.  (i,k)  €  M:  and  (k.j)  €  Nh} 

Lemma  1: 

(1)  MicM-  is  a  mapping 
(2) -yCMioNh)  s  7(Mi)  -  ^(M:) 

Proof: 

(1).  is  clear   from  the  definition  of  mapping. 
(2).    Let  Ml  be  the  mapping  from  Ti  to  T2  and  I;  and  Jj  be  the  corresponding  deletion  and  inser- 
tion sets.    Let  M2  be  the  mapping  from  T2  to  T3  and  I2  and  Jj  ^^  '^^  corresponding  deletion  and 
insertion   sets.     Let  MiO\l2    be   the  composed   mapping   from   Ti    to  T3    and   let   I  and   J   be   the 
corresponding  deletion  and  insertion  sets. 

Three  general  situations  occur,  (i.j)  €  M10M2.  i  €  I,  or  j  €  J.  In  each  case  this  corresponds 
to  an  editing  operation  ■y(x  ~  y)  where  x  and  y  may  be  nodes  or  may  be  .\.    In  all  such  cases,  the 


•  •  Note  that  our  definition  of  mapping  is  different  from  the  defininon  in  [T-"9]     W'e  believe  that  our  definition  is 
more  natural  because  it  does  not  depend  on  any  traversal  ordering  of  the  tree. 
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triangle  inequalirv'  on  the  distance  metric  y  ensures  that  -yCx  -  y)  s  -y{x  -  z)  -  •y(z  -  y).    a 
The  relation  ber^^een  a  mapping  and  a  sequence  of  edit  operation  is  as  follows: 

Lemma  2: 

Given  S,  a  sequence  S;,  ...  .s^  of  edit  operations  from  Ti  to  T;,  there  exists  a  mapping  M 
from  Ti  to  T2  such  that  ■y(M)  s  ^(S). 

Proof: 

This  can  be  proved  by  induction  on  k.  The  base  case  is  k=  1.  This  case  is  correct,  because  any 
editing  operation  preserves  the  ancestor  and  sibling  relationships  in  the  mapping.  In  general  case,  let 
Si  be  the  sequence  Sj,  ...  ,St^—  1  of  edit  operation.  There  exist  a  mapping  Mj  such  that  •y(Mi)  s 
"y(Si).    Let  M;  be  the  mapping  for  ■n;;.    Now  from  lemma  1,  we  have  following. 

7(MiOM2)  <'-y(Mi)-7(VI;)  <  7(S).    n 

Hence,  5(Ti,T2)  =  iiiin{7(M)|  .M  is  a  mapping  from  Ti  and  T;} 

There  has  been  previous  work  on  this  problem.  Tai  [T-79]  gave  the  previous  best  algorithm  for 
the  problem.  [Z-83]  is  an  improvement  of  [T-79],  giving  the  same  sequential  time  as  our  algorithm. 
Our  new  algorithm  is  simpler  than  [Z-83],  gives  a  better  parallel  time,  and  extends  to  related  prob- 
lems.   The  algorithm  of  Lu  [L-79]  does  not  extend  to  trees  of  more  than  a  two  levels. 
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2.    A  simple  new  algorithm 

This  algorithm,  unlike  [T-79],  [L-79].  and  [Z-83].  will,  in  its  intermediate  steps,  consider  the 
distance  between  two  ordered  forests.  At  first  sight  one  may  think  that  this  will  complicate  the  work 
for  us,  but  it  will  in  fact  make  matters  easier. 

We  use  a  postorder  numbering  of  the  nodes  in  the  trees.  In  the  postordering,  Ti[l  ..  i]  and 
T2[l  -.  j]  will  generally  be  forests  as  in  the  following  figure.  (The  edges  are  those  in  the  subgraph  of 
the  tree  induced  by  the  vertices.)  Fortunately,  the  definition  of  mapping  for  these  induced  ordered 
forests  is  the  same  as  for  trees. 


Tfl  ..71 


T[4] 


T[5] 


T[7] 


T[5] 


Notation:  Assume  that  tree  T  is  numbered  by  post  order.  l(i)  is  the  number  of  the  leftmost  leaf 
descendant  of  the  subtree  rooted  at  T[i].  When  T[i]  is  a  leaf,  l(i)=i.  p(i)  is  the  number  of  the 
parent  of  node  T[i].  We  define  p'-*(i)  =  i,  p*(i)  =  p(i),  p-(  i)  =p(  p^(  i) )  and  so  on.  Let  Lj  be  the 
depth  of  node  i.    (We  take  the  depth  of  the  root  to  be  1.)    Let  anc(i)  =  (p'^(i)l  0  s  k  s  L,}. 

2.1.    .\  first  attempt 

We  first  attempt  to  solve  the  problem  as  it  is  solved  for  strings.  We  try  to  compute  D(Ti[l  .. 
i]'T2[l  ..  j]),  where  1  s  i  s  N;  and  1  s  j  <  \2.    There  are  three  cases: 

Case  1.    Ti[i]  IS  not  touched  by  a  line  in  M. 
In  this  case  D(T',[1  ..  i],T;[l  ..  j])='  D(Ti[l  ..  i  -  1],T;[1  ..  j])- 7(Ti[i]-.\). 

Case  2.    T-i[j]  is  not  touched  bv  a  line  in  M. 
In  this  case  D(T'i[1  ..  i],T2[l  ..  j])='d(Ti[1  ..  il,T2[l  ••  J  -  1])- 7(A-T2[j]). 

Case  3.    Ti[i]  and  T2[j]  are  touched  by  lines  in  M. 
As  we  explain  later,  (i,j)  must  be  in  M  and  any  node  in  the  subtree  rooted  at  Ti[i]  can  only  be 
touched  by  a  node  in  the  subtree  rooted  at  T2[j]. 

Hence  D(Ti[l  ..  i],T-[l  ..  j])  = 
D(Ti[1  ..  l(i)  -   1],T2[1  ..  l(j)  -   ll)-D(Ti[l(i)  ..  i  -   l],T2[l(j)  ..  j  -   l])-7(Ti[i]-T2[j])  Recall 
that  if  i  >  j,  then  T[i..j]  =  0. 
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The  following  figure  shows  the  situation. 
(The  solid  lines  indicate  the  pairs  of  structures 
whose  distances  must  be  calculated.) 


Ti[l  ..l(i)-l] 


Ti[l(i)..i] 


T2[l  ..l(j)-l] 


T2[Uj)-]] 


Case  3  expresses  the  crucial  difference  between  strings  and  trees.  It  says  that  in  order  to  com- 
pute the  distance  between  the  forests  up  to  Ti[i]  and  T2[j],  we  need  the  distance  between  the  sub- 
trees rooted  at  those  nodes.  (Failing  to  recognize  this  caused  Lu's  algorithm  to  come  to  grief.)  Cal- 
culating it  requires  knowing  D(Ti[l(i)  ..  i  -  l],T2[l(j)  ••  j  -  1]),  which  our  first  attempt  did  not 
compute. 

In  general,  we  compute  D(T;[ii  ..  i],T2[Jl  ••  jD-    where 
U  €  {l(p^(i)),  l(p'(i)),  l(p-(i)),  ....  1}  and 
Ji  €  {l(p°(j)).  l(pkj)).l(p-(j)),  ••■^U- 

Note  here  for  tree  T  with  N  nodes,  node  1  is  1(N)  --  the  leftmost  descendant  of  the  root  and  the 
first  node  visited  in  the  postorder  traversal  --  which  equals  l(p  (i)),  where  L,  is  the  depth  of  node 
T[i]. 


2.2.    New  Algorithm 

We  first  present  three  lemmas  and  then  give  our  new  algorithm. 
Recall  that  anc(i)=  {p'^(i)|  0  <  k  <  L,} 

Lemma  3: 

(1)  If  ii  €  anc(i),  either  Ti[l(ii)  ..  i  -   1]  is  empty  or  ij  €  anc(i  -  1). 

(2)  If  ii  €  anc(i),  either  Ti[l(i;)  ..  l(i)  -   1]  is  empty  or  u  €  anc(l(i)  -   1). 

(3)  If  ii  €  anc(i),  either  Ti[l(i)  ..  i  -   1]  is  empty  or  i  €  anc(i  -  1). 

Proof: 

(1).  Suppose  ii  (.  anc(i).  Because  of  the  post-order  numbering,  p(i  -  1)  €  anc(i).  If  ii^p(i 
-  1),  then  either  ij  €  anc(i  -  1)  or  ij  is  in  a  subtree  to  the  right  of  p(i  -  1).  But  the  second  is 
impossible  since  ij  and  p(i  -  1)  are  both  ancestors  of  i.  If  ij  <  p(i  -  1),  then  Ti[l(ii)  ..  i  -  1]  is 
empty. 

(2).  Suppose  ii  €  anc(i).  If  l(ii)  =  l(i),  then  Ti[l(ii)  ..  l(i)  -  1]  is  empty.  If  l(ii)  <  l(i),  then 
il  €  anc(p(l(i)  -  1)).  So,  ij  €  anc(l(i)  -  1).  By  the  post-ordering  and  the  ancestor  assumption, 
l(il)  >  l(i)  is  impossible. 

(3).  Suppose  il  €  anc(i).  If  l(i)  =  i,  Ti[l(i)  ..  i-  1]  is  empty.  If  l(i)  <  i,  then  p(i  -  1)  =  i. 
So,  i  €  anc(i  -   1).    By  the  post-ordering,  l(i)  >  i  is  impossible.    D 

Lemma  4  deals  with  empty  trees  or  forests. 
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Lemma  4: 
(i)D(0,0)  =  O 

(ii)  D(Ti[l(ii)  ..  i],0)  =  D(Ti[l(ii)  ..  i  -   l],0)-7(Ti[i]-.\) 
(iii)  D(0,T:[l(j:)  ..  ]]')  =  D{0,T:[l(jj  ..  j  -  i])-7(A-T:[j]) 
where  ii  €  anc(i)  and  ji  €  anc(j) 

Proof: 

(i)  requires  no  edit  operation.    In  (ii)  and  (iii),  the  distances  correspond  to  the  cost  of  deleting 
or  inserting  the  nodes  in  Ti[l(ii)  ..  i]  and  T2[lGl)  ■•  J])  respectively.    □ 
Lemma  5  deal  with  the  general  situation. 

Lemma  5: 
D(Ti[l(ii)..i]J2[l(ji)  ..  j])=  mini 

D(Ti[l(ii)  ..  i  -  ll,T:[iai)  ••  ll)-^(T:[i]-.V), 

D(Ti[l(ii)  ..  il.T2[l(ji)  ..  j  -  1])-7(A-T2[j]). 

D(Ti[l(ii)  ..  1(1)  -  l],T2[l(ji)  •■  Uj)  -  ll)-D(Ti[l(i)  ..  1  -  1],T2[1(J)  ■■  j  -  l])-7(Ti[i]-T2[Jl)  } 
where  ii  €  anc(i)  and  ji  €  anc(j) 

Proof: 

Consider  a  minimum  cost  mapping  M  such  that  -y(M)=    D(Ti[l(ii)  ..  i],T2[l(ji)  ••  j]).    There 
are  three  cases. 

Case  1: 

T;[i]  is  not  touched  by  a  line  in  M.    In  this  case 
D(Ti[Uii)  ..  i],T2[l(ji)  ■■  J])  = 
D(Ti[l(ii)..i-  l],T2[iai)  •■  J])-7(Ti[.]-A) 

Case  2: 

T-'[j]  is  not  touched  bv  a  line  in  M.    In  this  case 
D(Ti[l(ii)  ..i],T2[Kji)  ..]])  = 
D(Ti[l(ii)  ..  i],T2[i(ji)  ••  j  -  1])-7(A-T2[j]) 
Case  3: 

Ti[i]  and  T2[j]  are  both  touched  b\  lines  in  M.    Since  a  mapping  must  preserve  ancestor  and 
sibling  relationships,  they  must  touch  each  other,  i.e.  (i,j)  €   M.    For  the  same  reason,  any  node  in 
the  subtree  rooted  at  Ti[i]  can  only  be  touched  by  a  node  in  the  subtree  rooted  at  T2[j]  and  vice 
versa.    Hence  we  have 
D(Ti[l(ii)  ..i],T2[iai)  •■]])  = 
D(Ti[l(ii)  ..  l(i)  -  l],T2[l(ji)  ..  l(j)  -  l])-D(Ti[l(.)  ..  i  -  l],T2[l(j)  ..  j  -  l])  +  7(Ti[i]-T2[j]) 

D(Ti[l(ii)  ..  i],T2[l(ji)  ••  j])  is  just  the  minimum  of  the  above  three  values.  □ 

We  are  now  ready  to  to  give  our  new  algorithm. 
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Algorithm  Basic  Distance 

begin 

D(0,0)=O 

for  i:=  1  to  N'l 
for  ii  i  anc(i) 
D(Ti[I(ii)..i].0)=D(Ti[l(ii)..i  -  l],0)->(Ti[i]-A) 

for  j:=  1  to  N2 

for  ji  €  anc(j) 
D(0.T;[l(ji)..j])  =  D(0.T2[l(ji)..j  -  1])-^(A-T2[j]) 

for  i:=  1  to  N'l 
for  j:=  1  to  N; 

for  ij  €  anc(i) 
for  ii  €  anc(j)  begin 
D(Ti[l(i:)  ..  i],f:[l(ji)  ..  j])=  min{ 
D(Ti[l(ii)  ..  i  -  l],T2[lCi:)  ■■  il)->(T;[il-A), 
D(Ti[liii)  ..  i].T2[l(ji)  •■  ]  -  1|)-7{A-T2[]1). 

D(T;[l(i!)  ..  l(ij  -  l].T2[l(ji)  ..  l(j)  -  l]j-Dai[l(i)  ..  i  -  ll,T:[l(j)  •■  j  "  1])- 7(Ti[i]-T2[j])  } 
end; 
end 

Theorem  1:    Algorithm  Basic  Distance  correctly  computes  tree  distance. 

Proof: 

From  lemma  3  we  know  that  all  the  distance  terms  used  in  the  right  hand  side  of  the  equations 
have  been  computed  previously.  From  lemma  4  and  lemma  5,  we  know  that  the  formulas  used  in 
above  algorithm  is  correct,    n 
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3.    Some  aspects  of  our  algorithm 


3.1.    Complexity 

Let  us  coi5sider  the  time  and  space  complexity  of  our  algorithm. 

Bv   definition   of   Lj,    |anc(i)j    =    L,.     Therefore    the    time    and    space   complexity    is   at   most 

i=N;'  J=N; 

0(  2  Li*  2  Lj).    If  we  substitute  for  L,  X  Lj  its  maximum  Lj  x   L2 ,  we  obtain  the  following. 
1=1  j=l 

The  time  and  space  complexity  is  0(N;   x   X2   x   Li   x   Lt)-    This  is  an  improvement  over  the 
0(Ni  X  N-  X  Li"  X  L;-)  complexity  of  [T-79]. 


3.2.  From  trees  to  strings 

The  analysis  in  previous  section  is  pessimistic.  For  special  trees  such  as  strings,  the  depth  terms 
disappear  altogether,  as  can  be  seen  by  observing  the  following.  If  we  define  lanc(i)  =  {l(ij)|  i^  € 
anc(i)},  then  we  can  rewrite  the  algorithm  as  follows: 

for  1:=  1  to  N; 
for  j:=  1  to  Nt 
for  1-   €  lanc(i) 

for  ji   €  lanc(j)  beem 
D(T;[li  ..  1J.T2U:  ■.il)=^rnin{ 
D(Ti[ii  ..i-  l].T:[ji  ..  j])-7(Ti[.]-,\), 
D(Ti[ii  ..i],T2[jl  ■■  J  -  11)->(A-T2[]1). 

D(Ti[ii  ..  l(i)  -  ll.T.Lii  •■  Uj)  -  l])-D(Ti[l(i)  ..  1  -  1],T2[1(J)  ••  J  -  l])-7(Ti[i]-T2[j])  } 
end; 

So,  the  complexity  should  be 

i=Ni  )=N: 

0(  2   |lanc(i)|  x    ^   llancij)  1). 
1=1  j=i 

So,  if  the  parent  of  node  i  has  only  i  as  a  child,  then  lanc(i)  =  Ianc(p(i)).  It  is  not  the  actual  depth  of 
i  that  matters,  therefore,  but  its  "collapsed  depth",  i.e.  the  number  distinct  leftmost  descendants  of 
nodes  on  the  path  from  i  to  the  root. 

In  the  important  extreme  case  of  a  string,  every  node  has  one  child,  so  every  node  has  collapsed 
depth  of  1.    Since  |lanc(i)|=  1,  time  and  space  complexity  of  0(Ni  x  NN)- 

This  is  a  nice  property.  This  means  that  our  algorithm  is  not  only  a  generalization  of  the  string 
algorithm  to  trees  but  also  that  when  the  input  is  really  a  string  the  complexity  is  the  same  as  that  of 
the  best  available  algorithm  for  the  general  problem  of  string  distance.  The  algorithms  in  [T-79]  and 
[Z-83]  do  not  have  this  property. 

3.3.  Parallel  Implementation 

A  transformation  of  our  algorithm  to  a  parallel  one  has  time  complexity  0(Ni-N2)  while  [T- 
79]  and  [Z-83]  have  time  complexity  0((N; -Nt)  <  (Li-L2))-  Our  algorithm  uses  0(min(Ni,N2) 
X    Li    X    L2)    processors.     Our   strategy   is   to   compute,    in    parallel,    all    distances   D(Ti[l(ii)..i], 
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T2[l{ji)--j])  for  which  i  -  j  =  k. 

We  now  present  the  parallel  algorithm.  (When  the  PARBEGIN  -  PAREND  construct  sur- 
rounds one  or  more  for  loops,  it  means  that  every  setting  of  the  iterators  in  the  enclosed  for  loops 
can  be  executed  m  parallel.  The  semantics  are  those  of  the  sequential  program  ignoring  this  con- 
struct.) 

Algorithm  Parallel  Distance 

begin 
D(0,0)  =  O 

for  i:=  1  to  N'l 
PARBEGIN 

for  i-   i  anc(i) 
D(T:[l(ii)..i],0)  =  D{Ti[l(i:)..i  -  l],2!)- 7(Ti[i]-.\) 
PAREND 

for  j:=  1  to  N2 

parbegin" 

for  J-   £  anc(j) 
D(0.T;[l(.U)..j])  =  D(0,T:[l(ji)..j  -   l])->(.\-T;[j]) 
PAREND 

for  k:=  2  to  N;  -X2 
PARBEGIN 

for  1  :=  ma.x(l,k  -Ni)  to  min(k  -   l,N-i)  do  begin 

j:=  k  -1, 
for  i;  €  anc(i) 
for  ii  €  anc(j)  do 

D(Ti[l(ii)  ..  il.T:[l(ji)  ..  ]])=  mm( 
D(Ti[l(i:)  ..  i  -  l].T:[iai)  ..  j])'7(T:[i]-A), 
D(Ti[l(ii)  ..  i].T2[l(ji)  ••  J  -  1])->(A-T2[j]). 

D(Ti[l(i:)  ..  l(i)  -  l],T;[Kji)  ..  Kj)  -  1])- D(Ti[l(i)  ..  i  -  1],T2[1(J)  ••  J  -  l])->(Ti[i]-T2[j])  } 
end 

PAREND 
end 

Theorem  2;    Parallel  Distance  Algorithm  is  correct  and  has  time  complexity  0(Ni-N2)- 

Proof: 

The  first  two  initializations  are  the  same  as  in  the  Basic  Distance  algorithm.  Let  us  consider  the 
general  case.  For  i  and  j  within  PARBEGIN  and  PAREND,  i-j=k.  We  now  show  that  all  the 
terms  used  in  the  min  expressions  have  been  previously  computed  (so  there  are  no  interdependencies 
among  the  terms  calculated  for  a  given  value  of  k).  In  the  first  term  i-l-j=k-l<k.  In  the 
second  term  i-j  -  l=k  -  1  <  k.  In  the  third  term  l(i)  -  l-l(j)  -l£i-l-j-l=k-2<k. 
In  the  fourth  termi-   1-j-  l  =  k-2<k. 

Since  no  sequential  loop  is  executed  more  than  Nj  -  N2  times,  the  Parallel  Distance  algorithm 
has  time  complexity  0(Ni-.\2).        o 
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4.    k-distance  problem 


Here  we  show  how  to  solve  a  specialization  of  tree  distance  in  better  time. 

Given  two  trees  Ti  and  T:  and  a  number  k.  we  would  like  to  know  if  the  distance  8(Ti,T2)  is 
less  than  k  or  not  and  in  case  it  is  less  than  k  to  give  the  actual  value  of  6(Ti,T2)-  For  simplicity,  we 
assume  that  the  cost  for  insert  and  delete  is  1.  (An  easy  generalization  to  arbitrary  costs  for  insertion 
and  deletion  is  to  take  the  minimum  insert  or  delete  cost  c  and  then  replace  k  by  k/c  everywhere, 
including  in  the  complexity  measures.) 

Though  we  can  use  our  general  tree  distance  algorithm  to  solve  the  above  problem,  we  have  a 
more  efficient  method.  Our  algorithm  has  time  complexity  0(k  x  min(Ni,N2)  x  Li  x  L2)  or 
0(k"  X  min(Ni..\2)  ^  min(Li,L2).  Note  that  if  k  is  small,  this  is  a  big  improvement  over  the 
Basic  Distance  Algorithm. 

As  always,  we  take  our  inspiration  from  strings.  The  following  algorithm  is  a  simple  0(k  x 
min(Ni,N2)  )  algorithm  for  the  k-distance  problem  among  strings. 

for  i:=  1  to  X; 

for  j:=  max(i  -  k,lj  to  min(i-  k.N-)  begin 
ifTi[i]  =  T2[j] 
then  d:  =  0 
else  d:=  1; 
D(i,j)=min(  D(i,j  -  Ij-  l,D(i  -   l,j)-  l,D(i  -   l,j  -  l)-d  } 
end; 

The  idea  is  that  we  do  not  need  to  compute  any  D(i,j)  such  that  [i  -j|>k.  The  reason  is  that  for 
those  D(i,j),  D(i,j)  s  k.    Hence  such  terms  will  not  be  useful  in  any  later  computation. 

The  difficulty  in  the  tree  case  is  that  even  if  D(T:[l..i],T2[l..j])  is  greater  than  k, 
D(Ti[l(ii)..i].T2[l(ji).-j])  may  be  smaller  than  k,  where  i;  €  anc(i)  and  ji  i  anc(j).  Our  next 
lemma  shows  that  we  don't  have  to  worry  about  such  terms. 

Lemma  6: 

If  D(Ti[l..i],T2[l..j])>k,  then  in  any  mapping  from  Ti[l..i']  to  T2[l.-j']  such  that 
D(Ti[l..i"],T2[l.j'])sk,  no  minimal  mapping  from  Ti[l(ii)..i]  to  T2[l(jl)-j]  *'"  ^^  used  (i.e.  it 
will  not  be  a  submapping).- 

Proof: 

By  contradiction.  If  a  minima!  mapping  from  T-jL.i']  to  T2[l.j']  uses  any  minimal  mapping 
from  Ti[l(ii)..i]  to  T2[l(ji)..j],  then  from  the  conditions  a  mapping  must  follow  we  know  that 
Ti[l..l(ii)  -  1]  must  map  to  T2[l.l(ji)  -  1]-  Therefore,  D(Ti[l..l(ii)  -  l],T[l..l(ji)  -  1]) 
-D(Ti[l(ii)..i],T-[l(ji)..j])  >D(Ti[l..i],T^[l..j])>k  This  would  imply  that 

D(Ti[l..i'l,T2[l..j'])>k.  n 

Now  it  is  easy  to  see  how  the  algorithm  works.  We  only  compute  D(Ti[l(ii)..i],T2[l(ji)--j] 
when  |i  -jjsk.  In  the  computation  if  we  need  to  use  D(Ti[l(ii)..i],T2[l(ji)..j]  such  that  |i  -j|>k,  we 
just  substitute  the  value  k-  1.    So,  the  general  step  of  the  algorithm  becomes: 


*  •  This  lemma  applies  to  general  costs,  so  we  may  do  better  by  using  it  instead  of  replacing  k  by  k-'c  as  we  proposed 
above. 
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for  i:=  1  to  Ni 
for  j:=  max(i  -  k,l)  to  rnin(i*  k.N;) 
for  ii  €  anc(i) 
for  ji  €  anc(j)  begin 
inner  loop  computation  from  Basic  Distance  Algorithm 

The  complexity  is  clearly  0(k  x  min(Ni,N2)  x  Li  x  L2)-  But  we  can  do  better.  For  each 
Tl[l(ii)-i],  there  are  at  most  2k  terms  from  T2[lOl)-i]  such  that  D(Ti[l(i]  )..i],T2[l(jl)-J]  -  k. 
Therefore  we  can  manage  to  have  an  algorithm  with  complexity  0(k"  x  min(Ni,N2)  ^ 
min(Li,L2))- 

for  i:=  1  to  N'l 
for  j:=  max(i  -  k,l)  to  min(i-  k,.\2) 
for  ii  €  anc(i) 
for  ji  €  anc(j) 

such  that  Id  -  l(ii))  -  (j  -  l(ji))|s  k 
inner  loop  computation  from  Basic  Distance  Algorithm 

To  take  advantage  of  other  heuristics,  the  following  lemma  :s  useful. 

Lemma  7: 

If  D(Ti[l..i].T2[l..j])-D(Ti[i- l..N:],T2[j- l..N2])>k,  then  in  computing  D(Ti,T2)^k 
D(Ti[l(ii)..i],T:[l(ji)..j]  will  not  be  useful,    a 

In  general,  to  use  the  lemma  6  one  must  compute  2k  diagonals  whereas  using  lemma  7  only  k 
diagonals  are  needed.  Another  way  to  use  this  lemma  is  to  estimate  D[Ti[l..i],T2[l-j]  and 
D[Ti[i-  l-Ni],T2[]-  1-N2]  using  less  expensive  heuristics  such  as  string  matching  or  label  counting 
and  then  to  disregard  unhelpful  intermediate  mappings. 

The  parallel  time  for  this  algorithm  is  the  same  as  for  the  Basic  Distance  algorithm  but  only  0(k 
X  Li  X  L2)  are  needed. 
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5.    Distance  algorithm  as  a  general  technique 


Many  problems  in  strings  can  be  solved  with  dynamic  programming.  Similarly,  our  algorithm 
not  only  applies  to  tree  distance  but  also  provides  a  way  to  do  dynamic  programming  for  a  variety  of 
tree  problems. 

Here  is  the  general  pattern  (assuming  a  postorder  traversal): 

empty_initialization 

for  i:=  1  to  N'l 
for  ii  €  anc(i) 
left_initialization 

for  j:=  1  to  Nt 

for  J2  €  anc(j)  begin 
right_initialization 

for  i:=  1  to  \| 
for  j:=  1  to  N; 
for  ij  €  anc(i) 
for  ji  €  anc(j)  begin 

general_term_computation 

In  the  next  rwo  sections,  we  give  four  examples  of  apparently  more  complex  problems  that  can 
be  solved  by  the  same  technique  and  in  the  same  (serial  and  parallel)  time  and  space  complexity  as 
the  Basic  Distance  algorithm. 
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6.    Removals  at  a  vertex 


6.1.    Single  Remove  subtrees  from  one  tree 


In  this  section,  we  consider  the  calculation  of  the  minimum  distance  between  two  trees  with  a 
subtree  removed  from  one  of  them. 


T' 


T[ 


> 


1]  T,]/V^^^ 


T[7] 


T[9] 


T[1J  T[2] 


T[4] 


T[5] 


Remove  subtree  rooted  at  T[8] 

The  problem  is  as  follows:  Given  trees  T;  and  T;,  we  want  to  know  what  is  the  minimum  dis- 
tance between  Ti  with  a  subtree  removed  and  T;. 

Let  DRl(Ti[l(ii)  ..  i].T;[l(j;)  ..  j])  denote  the  minimum  distance  between  Ti[l(ii)  ..  i]  and 
T2[l(jl)  •■  Jl  ^"'^h  that  one  subtree  is  removed  from  Ti[l(ii)  ..  i].  The  following  initialization  and 
general  term  computation  steps  will  give  us  an  algorithm.  Note  D()  is  the  distance  in  the  sense  of  the 
Basic  Distance  Algorithm. 
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Algorithm  Single  Subtree  Removal 

empty_initialization: 
DR1(0,0)  =  == 

leftjnitialization: 

DRl(Ti[l(ii)..i].0)  =  min{ 
D(Ti[l(ii)..l(i)  -  1].0), 
DRl(Ti[l(ii)  ..  i  -  l],0)-7(Ti[i]-An 

right_initialization: 
DRi('0,T:[l(ji)..]])=x 

general_term_computation 
DRUTi[l(ii)  ..  il.T^flCli)  ■•  ]])=  rnmj 

D(Ti[l(i:)  ..1(1)  -  l].T2[Kii)  ..  j]). 

DRKTuKiiJ  ..  1  -  l],T;[l(ji)  ..  j])- 7(Ti[i]-A). 

DRl(Ti[l{ii)  ..  i],T:[l(ji)  ..  J  -  ll)-7(A-T2[j]). 

DRUTi[l(ii)  ..  l(i)-l].T:[l(iO  ..  l(j)-l])-D(T:[l(i)  ..  i-l],T:[l(j)  ..  j- l])->(Ti[i]-T2[j]), 

D(Ti[l(ii)  ..  l(i)-l],T2[l(ji)  •■  !(j)-l))-DRUT;[l(i)  ..  i-l],T:[l(j)  ..  j- ll)*7(Ti[i]-T;[j])  } 

Lemma  8:    Single  Subtree  Removal  algorithm  is  correct. 

Proof: 

First  the  empt\'_initialization  and  right_initiali2a!ion  are  correct  because  no  subtree  can  be 
removed  from  an  empty  tree. 

For  the  left_initialization  there  are  two  cases.  Either  subtree  Ti[l(i)  ..  i]  should  be  removed,  in 
which  case  DRl(Ti[l(ii)  ..  i].0)  =  D(Ti[l(ii  )..l(i)  -  1],0);  or  a  subtree  in  Ti[l(ii)  ..  i  -  1]  should 
be  removed,  in  which  case  DRl(Ti[l(i;  )..i],0)  =  DRUTi[l(ii)  ..  i  -  1],0)-  -yCTifiJ-A)} 

Hence  the  left_initialization  is  correct. 

Now  let  us  consider  the  general_term_computation. 
Case  (1):  subtree  T:[l(i)  ..  i]  Ts  removed.    So,  DRl(Ti[l(ii)  ..  i].T^[l(ji)  ..  j])  = 
D(Ti[l(ii)  ..  l(i)  -   l],T;[l(ji)  ..  j]) 

Case  (2):  subtree  Ti[Ui)  ..  i]  is  not  removed.    Consider  the  best  mapping  between  Ti[l(ii)  ..  i]  and 
T:[UJi)  ..  j]  with  one  subtree  removed  from  T;[l(ii)  ..  i]. 

There  are  three  subcases. 

subcase  1:  i  is  not  in  the  mapping.    In  this  case, 
DRl(Ti[l(ii)  ..i],T2[I(ji)  •■]■])  = 

DRl(Ti[l(ii)  ..  i  -  l],T2[iai)  •■  J])-7(Ti[i]  -  >A), 

subcase  2:  j  is  not  in  the  mapping.    In  this  case, 
DRl(Ti[l(ii)  ..i],T2[iai)  ••]■])  = 

DRl(Ti[l(ii)  ..  i],T2[iai)  •■  j  -  1])-7(A-T;[j]), 

subcase  3:  i  and  j  are  both  in  the  mapping. 
In  this  case  there  are  two  different  situation. 

3a;  subtree  is  from  Ti[l(ii)  ..  l(i)  -    I].    In  this  case, 

DRHTi[l(ii)..i],T2[iai)  ••]])  = 
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DRl(Ti[l(ii)  ..  l(i)-l],T;[l(ji)  ■•  l(j)-ll)-D(Ti[I(i)  ..  i-ll-TzlKj)  •■  J- 1])- >(Ti[i]-T2[j]) 

3b:  subtree  is  removed  from  T[[l(i)  ..  i  -  1].    In  this  case, 
DRl(Ti[l(ii)  ..i],T2[lOl)  ■■}])  = 

D(Ti[l(ii)  ..  l(i)-l],T2[l(ji)  ■■  l(j)-l])-DRl(T:[l(i)  ..  i-l]J2[l(j)  ..  j- ll)->(Ti[il-T2[j])  }. 


6.2.    Prune  subtrees  from  one  tree 

In  this  section,  we  consider  a  similar  problem,  the  calculation  of  the  minimum  distance  between 
two  trees  with  a  pruning  at  a  node  of  one  of  the  trees.  By  pruning  at  T[i].  we  mean  removing  of  all 
the  proper  descendants  of  T[i]  but  keeping  T[ij  itself.  (Thus,  a  pruning  never  eliminates  the  entire 
tree.) 


T' 


T[4] 


T[5] 


T[8] 


Pruning  at  T[8]  --  remove  all  its  proper  descendants 

Formally:  Given  trees  Tj  and  Tt,  we  want  to  know  what  is  the  minimum  distance  between  Ti 
that  has  been  pruned  at  some  node  and  T2. 

A  naive  application  of  our  distance  algorithm  would  require  0(Xi     x    the  time  to  run  the  tree 
distance  algorithm).    We  now  give  a  algorithm  to  do  it  directly. 

We  need  the  following  (slightly  counterintuitive)  definition.    A  pruning  for  Ti[l(ii)   ..  i]  can 
mean 

i)  there  is  a  pruning  at  one  node  in  Ti[l(ii)  ..  i];  or 

ii)  there  is  a  pruning  at  p(i),  but  this  is  onlv  allowed  if  all  the  proper  descendants  of  p(i)  are  in 
Ti[l(ii)..i]. 

We  denote  the  condition  by  the  predicate  arewithin(i,\i).    This  holds  if  and  only  if  l(ii)^l(p(i)) 
and  i  is  the  rightmost  child  of  p(i). 

Let  DPl(Ti[l(ii)   ..  i],T2[l(ji)  •■  j])  denote  the  minimum  distance  between  Ti[l(ii)  ..  i]  and 
T2[l(jl)  ••  Jl  s"<^^  tb^t  there  is  a  pruning  for  Ti[l(ii)  ..  i]. 

The  following  initialization   and   general   term   computation   steps   will  give   us   an   algorithm. 
Again,  D()  is  the  distance  in  the  sense  of  the  Basic  Distance  algorithm 
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Algorithm  Single  Prune 
empty_initialization: 

DPl('0,0)  =  =c 

leftjnitialization: 

DPl(Ti[l(ii)..i],0)  = 

DPl(Ti[I(ii)  ..i  -  l],0)-^(T;[i]-.\) 
if  arewithin{i,ii)  then 
DPl(Ti[l(ii)..i],0)  =  min{ 
DPl(Ti[l(ii)..i],0). 
D(T:[l(i;)..l(p(i))  -  1],0)} 

right_initialization: 
DPl(0,T:[ia:)..Jl)  =  - 

general_term_computation 
DPl(Ti[l(i:)  ..i].T:[i(jr)  ..  j])=  min{ 

DPl(Ti[l(ii)  ..  1  -  l],T2[l(j:)  ..  j])->(Ti[i]-A). 

DPl(Ti[l(ii)  ..  il,T2[l(j:)  ••  j  -  ll)-^fA-T:[j]). 

DPl(Ti[l(ii)  ..  l(.)-l],T:[!(ji)  ..  lf]l-l])-D(Ti[l(i)  ..  i-l],T;[l(j)  ..  j- ll)-7(T:[i]-T2[j]), 

D(Ti[l(ii)  ..l(i)-l],T:[l(j:)  ..  l(j)- i])-DPi(T:[l(i)  ..  i-il,T:[l(]j  ..  j- il)-7(Ti[i]-T2[j])  } 
if  arewithin(i,ii)  then 

DPl(Ti[l(ii)  ..  i],T;[iai)  ■•  j])=  min{ 
DPl(Ti[l(i:)  ..i].T2[l(ji)  ..]]). 
D(Ti[l(ii)  ..  l(p(i))  -  l],T2[l(jl)  ••:])} 


Lemma  9:   Algorithm  Single  Prune  is  correct. 

Proof: 

First  the  emptv_initialization  and  right_initialization  are  correct  because  no  pruning  can  occur 
on  an  empty  tree. 

For  the  left_initialization  there  are  two  cases.  If  all  descendants  of  p(i)  are  in  Ti[l(ii)  ..  i],  we 
can  prune  at  p(i).  That  means  we  remove  Ti[l(p(i))  ..  i].  In  this  case  DPl(Ti[l(ii)  .. 
i],0)  =  D(Ti[l(ii)  ..  l(p(i))  -  1],0).  Otherwise,  we  can  prune  for  Ti[l(i;)  ..  i  -  1],  giving  cost 
DPl(Ti[Uii)..i],0)=DRl(Ti[l(ii)  ..  i  -   l],0)-g(Ti[i]-A)}. 

Hence  the  left_initialization  is  correct. 

Now  let  us  consider  the  general_term_computation. 
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First  there  are  two  cases: 

Case  (1):  Ti[l(p(i))  ..  i]  is  removed.    So, 

DPl(Ti[l(ii)  ..  il,T2[lOi)  ••]])  = 

D(Ti[l(ii)  ..  l(p(i))  -  l],T2[l(j;)  ••  il) 
Note:  this  case  is  conditional,  depending  on  if  all  the  descendants 
of  p(i)  are  in  Ti[l(ii)  ..  i),  i.e.  arewithin(i,!i). 

Case  (2):  Ti[l(p(i))  ..  i]  is  not  removed. 

Consider  the  best  mapping  between  T;[l(ii)  ..  i]  and  T2[I(ji)  ■•  j] 

with  a  pruning  at  a  node  in  Ti[l(ii  j  ..  i]. 

There  are  three  subcases. 

subcase  1:  i  is  not  in  the  mapping.    In  this  case, 
DPl(Ti[l(ii)  ..i],T:[l(ji)  ..  j])  = 

DPl(Ti[l(ii)  ..  I  -  l],T;[l(j:)  .-  j])-v(Ti[il-,V), 

subcase  2:  j  is  not  in  the  mapping.    In  this  case, 
DPl(Ti[Ki.)  ..i],T2[l(ji)  ..]])  = 

DPl(Ti[l(li)  ..  i],T2[l(ji)  ••  j  -   Il)-7(A-T;[j]). 

subcase  3:  i  and  j  are  both  in  the  mapping. 
In  this  case  there  are  two  different  situations. 

3a:  There  is  a  pruning  for  Ti[l(ii)  ..  l(i)  -   1].    In  this  case, 
DPl(Ti[lCii)  ..i],T2[l(jl)  ■■]■])  = 

DPl(Ti[l(ii)  ..  l(i)-ll,T2[lCii)  ••  l(j)-ll)-D(Ti[l(i)  ..  i-l].T2[l(j)  ..  j-l])-7(Ti[i]-T2[j]) 

3b:  There  is  a  pruning  for  Ti  [l(i)  ..  i  -   1].    In  this  case, 
DPl(T,[l(ii)  ■.il,T2[l(jl)  ••*]])  = 

D(Ti[l(ii)  ..l(i)-I],T2[Uji)  ..  l(j)-l])-DPl(Ti[l(i)  ..  i-l].T;[l(j)  ..  j-l])->(Ti[i]-T2[j]) 
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7.    Approximate  tree  matching 

We  consider  here  approximate  tree  matching.  Hoffman  and  0"Donnell  [HO-82]  have  proposed 
an  algorithm  for  exact  tree  matching.  To  generalize  the  problem,  we  first  consider  approximate 
matching  [S-80,  U-83,  U-85,  LV-86]  for  strings.  The  problem  is  to  compute,  for  each  i,  the 
minimum  number  of  editing  operations  berween  the  "pattern"  string  PAT[1..|PAT|  |  and  the  "text 
string"  TEXT[l..i]  with  a  prefix  removed  (from  TEXT).  (Intuitively,  the  algorithm  finds  the 
"occurrence"  in  TEXT  that  most  closely  matches  PAT.) 

To  study  this  problem  to  trees,  we  must  generalize  the  notion  of  prefix.  For  us,  a  prefix  will 
mean  a  collection  of  subtrees.  These  subtrees  can  be  arbitrary  or  can  arise  as  a  result  of  zero  or 
more  prunings  (section  6.2).    We  consider  each  generalization  in  turn. 

7.1.    Remove  any  number  of  subtrees  from  TEXT  tree 


The  problem  is  as  follows:  Given  trees  T]  and  T:,  we  want  to  know  what  is  the  minimum  dis- 
tance between  Tj  with  zero  or  more  subtrees  removed  and  T;. 

Let  DR(T;[l(i;)  ..  i],T;[l(i:)  ..  j))  denote  the  minimum  distance  between  Ti[l(ii)  ..  i]  and 
T2[I(ji)  ••  j]  with  zero  or  more  subtrees  removed  from  Ti[l(ii)  ..  i|. 

Algorithm  Many  Subtree  Removal 

empty_initiaIization: 
DR(0,0)  =  O 

left_initialization: 

DR(Ti[l(ii)..i],0)  =  O 

right_initialization: 

DR(0,T;[l(ji)..j])  =  DR(Z,T;[lGi)..j  -  1])-^(A-T:[j]) 

general_term_computation 
DR(Ti[l(ii)  ..  i],T2[l(ji)  ..  jl)=  min{ 

DR(T-[l(i;)  ..1(1)  -  l],T2[l(ji)  ..  j]), 

DR(Ti[l(,i)  ..  1  -  l],T2[l(ji)  ..  j])-7(Ti[i]-.V), 

DR(T.[l(i,)  ..  i],T2[l(J:)  ..  j  -  1])-7(A-T;[j]). 

DR(Ti[l(ii)  ..  l(.)-l],T2[iai)  ••  l(j)-l])-DR(Ti[l(i)  ..  i-l],T2[l(j)  -.  j- l])-^(Ti[i]-T2[j])  } 

Lemma  10:    Algorithm  Many  Subtree  Removal  is  correct. 

Proof:  1 

First  we  show  that  the  initialization  is  correct.  The  empty-initialization  and  the 
right_initialization  is  the  same  as  in  the  tree  distance  algorithm.  The  left-initialization 
DR(Ti[l(ii)..i],0)=  0  is  correct,  because  we  can  remove  all  of  Ti[l(ii)..i]. 

For  the  general  term  DR(Ti[l(ii)  ..  i].T2[lGi)  •■  j]),  we  ask  first  whether  the  subtree  Ti[l(i)  .. 
i]  is  removed  or  not.  If  it  is  removed,  then  the  distance  should  be  DR(Ti[l(ii)  ..  l(i)  -  ll,T2[l(jl)  •• 
j])  giving  the  first  term  of  the  minimization.  Otherwise,  consider  the  mapping  between  Ti[l(ii)  ..  i] 
and  T2[l(ji)  ..  j]  after  we  perform  an  optimal  removal  of  subtrees  of  Ti[l(ii)  ..  i].    To  compute  this 
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mapping,  we  have  the  same  three  cases  as  in  the  tree  distance  algorithm.    Hence  the  general  term 
should  be  the  minimum  of  above  four    terms.    □ 


7.2.    Prune  at  any  number  of  nodes  from  the  TEXT  tree 


Given  trees  Tj  and  T;,  we  want  to  know  what  is  the  minimum  distance  berween  Ti  and  Tt 
when  there  have  been  zero  or  more  prunings  at  nodes  of  T; . 

Let  DP(Ti[l(ii)  ..  i].T2[lQ!)  ..  j])  denote  the  minimum  distance  between  Ti[l(ii)  ..  i]  and 
Tif'Ol)  ••  Jl  *'t^  ^^^°  °^  more  prunings  for  Ti[l(ii)  ..  i].  (Refer  to  section  6.2  for  the  definition  of 
"pruning  for".)  The  following  initialization  and  general  term  computation  steps  will  give  us  an  algo- 
rithm to  solve  our  problem. 

Algorithm  .Many  Prunings 

empty_initialization: 
DP(0.i2;)  =  O 

leftjnitialization: 
DP(Ti[l(ii)..i],0)  = 

DP(Ti[l(ii)..i  -  l],0)-7(Ti[i]-.\) 
if  arewithin(i.i;)  then 
DP(Ti[l(ii)..ij.0)  = 
DP(Ti[l(ii)..l(p(i))  -   1],0) 

right_initialization: 

DP(0,T:[i(ji)..Jl)  =  DP(0,T:[l(j;l..j  -  l])->(.\-T;[jl) 

general_term_com  potation 
DP(Ti[l(i,)  ..  i],T:[l(ji)  ..  j1)=  mini 
DP(Ti[l(ii)  ..  i  -   l],T:[iai)  ..  Jl)-7(Ti[il-.\), 
DP(Ti[l(ii)  ..  i],T:[l(ji)  ..  j  -  l])-7(.\-T2[j)). 

DP(Ti[l(ii)  ..  l(i)-l].T2[l(ji)  ••  l(])-l])-DP(Ti[l(i)  ..  i-l],T;[l(j)  ..  j-l])-7(Ti[i]-T2[j])  } 
if  arewithin(i,ii)  then 
DP(Ti[l(ii)  ..  i],T:[l(ji)  ..  jl)=  mini 
DP(Ti[l(ii)  ..  il,T:[l(ji)  ..  Jl), 
DP(Ti[l(i:)  ..l(p(i))  -  l],T2[l(Ji)  ••  j])} 

Lemma  11:   Algorithm  Many  Prunings  is  correct. 

Proof: 

First  we  show  that  the  initialization  is  correct.  The  empty-initialization  and  the 
right_initialization  is  the  same  as  in  the  tree  distance  algorithm.  For  left-initialization,  if  we  can 
remove  Ti[l(p(i))  ..  i]  (prune  at  p(i))  then  DP(Ti[l(ii)..i],0)  =  DP(Ti[l(ii)..l(p(i))  -  1],0).  Other- 
wise DP(Ti[l(ii)..i],0)  =  DP(Ti[l(ii)..i  -  l],0)-7(Ti[i]-A).  Hence  the  leftjnitialization  is 
correct. 

For  the  general  term  DP(Ti[l(i!)  ..  i],T2[l(ji)  ■■  j]).  we  ask  first  whether  Ti[l(p(i))  ..  i)  is 
removed  or  not.  If  it  is  removed,  then  the  distance  should  be  DP(Ti[l(ii)  ..  l(p(i))  -  l].T2[l(ji)  •. 
Jl)  giving  the  first  term  of  the  minimization.  Otherwise,  consider  the  mapping  between  Ti[l(ii)  ..  i) 
and  T2[l(ji)  ■•  j]  after  we  perform  an  optimal  number  of  prunings  for  Ti[l(ii)  ..  i].    Now  we  have 
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the   same   three  cases  as   in   the    tree   distance   algorithm.     Hence   the   general   term   should   be   the 
minimum  of  above  four  terms,    o 

Let  us  now  condiser  the  problem  of  approximate  tree  matching.  In  above  algorithm,  let  Tj  be 
TEXT  and  T;  be  PAT  and  set  all  cost  be  1.  We  have  now  a  algorithm  for  approximate  tree  match- 
ing. The  result  is  in  D(TEXT[l(i)  ..  i],PAT[l  ..  N;]),  where  Is  i  <  Nj.  Note  ihat  if  i  is  not  only 
child  of  its  parent,  we  need  to  check  if  D(i2,PAT[l  ..N-])  is  smaller  than  D(TEXT[l(i)  ..  il,PAT[l  .. 


8.    Conclusion 

We  present  a  simple  dynamic  programming  algorithm  for  tree  distance  which 

1.  has  better  time  and  space  complexity  than  any  in  the  literature; 

2.  is  efficiently  parallelizable; 

3.  can  be  specialized  to  the  k-distance  problem  for  trees  with  much  improved  efficiency;  and 

4.  is  generalizable  to  approximate  tree  matching  problems. 

Our  research  suggests  two  broad  avenues  for  further  algorithmic  work.  First,  we  would  like  to 
generalize  these  algorithms  to  unordered  trees.  Second,  we  would  like  to  consider  distance  metrics 
other  than  editing  distance 
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